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APPARATUS AND METHOD FOR SELECTIVELY OVERRIDING 

RETURN STACK PREDICTION IN RESPONSE TO DETECTION 

OF NON-STANDARD RETURN SEQUENCE 

by 

Thomas C. McDonald 
G. Glenn Henry 

PRIORITY INFORMATION 

[0001] This application claims priority based on U.S. 

Provisional Application, Serial No. , filed 

September 8, 2003, entitled APPARATUS AND METHOD FOR 
OVERRIDING RETURN STACK PREDICTION IN RESPONSE TO DETECTION 
OF NON-STANDARD RETURN. 

FIELD OF THE INVENTION 

[0002] This invention relates in general to the field of 
branch prediction in microprocessors and particularly to 
return instruction target address prediction using return 
stacks and branch target address caches. 

BACKGROUND OF THE INVENTION 

[0003] A microprocessor is a digital device that 
executes instructions specified by a computer program. 
Modern microprocessors are typically pipelined. That is, 
they operate on several instructions at the same time, 
within different blocks or pipeline stages of the 
microprocessor. Hennessy and Patterson define pipelining 
as, "an implementation technique whereby multiple 
instructions are overlapped in execution." Computer 



Docket CNTR.2231 



2 



Architecture: A Quantitative Approach, 2 nd edition, by John 

L. Hennessy and David A. Patterson, Morgan Kaufmann 

Publishers, San Francisco, CA, 1996. They go on to provide 

the following excellent illustration of pipelining: 

A pipeline is like an assembly line. In an automobile 
assembly line, there are many steps, each contributing 
something to the construction of the car. Each step 
operates in parallel with the other steps, though on a 
different car. In a computer pipeline, each step in 
the pipeline completes a part of an instruction. Like 
the assembly line, different steps are completing 
different parts of the different instructions in 
parallel. Each of these steps is called a pipe stage 
or a pipe segment. The stages are connected one to 
the next to form a pipe - instructions enter at one 
end, progress through the stages, and exit at the 
other end, just as cars would in an assembly line. 

[0004] Microprocessors operate according to clock 
cycles. Typically, an instruction passes from one stage of 
the microprocessor pipeline to another each clock cycle. 
In an automobile assembly line, if the workers in one stage 
of the line are left standing idle because they do not have 
a car to work on, then the production, or performance, of 
the line is diminished. Similarly, if a microprocessor 
stage is idle during a clock cycle because it does not have 
an instruction to operate on - a situation commonly 
referred to as a pipeline bubble - then the performance of 
the processor is diminished. 

[0005] A potential cause of pipeline bubbles is branch 
instructions. When a branch instruction is encountered, 
the processor must determine the target address of the 
branch instruction and begin fetching instructions at the 
target address rather than the next sequential address 
after the branch instruction. Because the pipeline stages 
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that definitively determine the target address are well 
below the stages that fetch the instructions, bubbles are 
created by branch instructions. As discussed more below, 
microprocessors typically include branch prediction 
mechanisms to reduce the number of bubbles created by 
branch instructions . 

[0006] One particular type of branch instruction is a 
return instruction. A return instruction is typically the 
last instruction executed by a subroutine for the purpose 
of restoring program flow to the calling routine, which is 
the routine that caused program control to be given to the 
subroutine. In a typical program sequence, the calling 
routine executes a call instruction. The call instruction 
instructs the microprocessor to push a return address onto 
a stack in memory and then to branch to the address of the 
subroutine. The return address pushed onto the stack is 
the address of the instruction that follows the call 
instruction in the calling routine. The subroutine 

ultimately executes a return instruction, which pops the 
return address off the stack, which was previously pushed 
by the call instruction, and branches to the return 
address, which is the target address of the return 
instruction. An example of a return instruction is the x86 
RET instruction. An example of a call instruction is the 
x86 CALL instruction. 

[0007] An advantage of performing call/return sequences 
is that it allows subroutine call nesting. For example, a 
main routine may call subroutine A that pushes a return 
address; and subroutine A may call subroutine B that pushes 
a return address; then subroutine B executes a return 
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instruction that pops the return address pushed by 
subroutine A; then subroutine A executes a return 
instruction that pops the return address pushed by the main 
routine. The notion of nesting subroutine calls is very 
useful and the example above may be extended to as many 
calls deep as the stack size can support. 

[0008] Because of the regular nature of call/return 
instruction sequences, modern microprocessors employ a 
branch prediction mechanism commonly referred to as a 
return stack to predict the target addresses of return 
instructions. The return stack is a small buffer that 
caches return addresses in a last-in-first-out manner. 
Each time a call instruction is encountered, the return 
address to be pushed onto the memory stack is also pushed 
onto the return stack. Each time a return instruction is 
encountered, the return address at the top of the return 
stack is popped and used as the predicted target address of 
the return instruction. This operation reduces bubbles, 
since the microprocessor does not have to wait for the 
return address to be fetched from the memory stack. 
[0009] Return stacks typically predict return 
instruction target addresses very accurately due to the 
regular nature of call/return sequences. However, the 
present inventors have discovered that certain programs, 
such as certain operating systems, do not always execute 
call/ret instructions in the standard fashion. For 
example, code executing on an x86 microprocessor may 
include a CALL, then a PUSH to place a different return 
address on the stack, then a RET, which causes a return to 
the pushed return address rather than to the address of the 
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instruction after the CALL, which was pushed onto the stack 
by the CALL. In another example, the code performs a PUSH 
to place a return address on the stack, then performs a 
CALL, then performs two RET instructions, which causes a 
return to the pushed return address in the case of the 
second RET rather than to the instruction after a CALL that 
preceded the PUSH. This behavior causes a misprediction by 
a return stack. 

[0010] Therefore, what is needed is an apparatus for 
more accurately predicting a return instruction target 
address, particularly for code that executes a non-standard 
call/return sequence . 

SUMMARY 

[0011] The present invention provides an apparatus for 
detecting a return stack misprediction and responsively 
setting an override flag associated with the return 
instruction so that upon the next occurrence of the return 
instruction the microprocessor can predict the return 
instruction target address by a mechanism other than the 
return stack. A branch target address cache (BTAC) is 
employed to store the override flag associated with the 
return instruction. In one embodiment, the other mechanism 
for predicting the return instruction target address is the 
BTAC, which is perhaps typically less accurate at 
predicting a return instruction target address than the 
return stack in the case of a normal call/return sequence, 
but more accurate in the case of code that executes a non- 
standard call/return sequence. 

[0012] In one aspect the present invention provides a 
microprocessor. The microprocessor includes a return stack 
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that provides a first prediction of a target address of a 
return instruction. The microprocessor also includes a 
branch target address cache (BTAC) that provides a second 
prediction of the target address of the return instruction 
and an override indicator. The override indicator 

indicates a predetermined value if the first prediction 
mispredicted the target address for a first instance of the 
return instruction. The microprocessor also includes 
branch control logic, coupled to the return stack and the 
BTAC, which causes the microprocessor to branch to the 
second prediction of the target address, and not to the 
first prediction, for a second instance of the return 
instruction, if the override indicator indicates the 
predetermined value. 

[0013] In another aspect, the present invention provides 
an apparatus for improving branch prediction accuracy in a 
microprocessor having a branch target address cache (BTAC) 
and a return stack that each generate a prediction of a 
target address of a return instruction. The apparatus 
includes an override indicator. The apparatus also 
includes update logic, coupled to the override indicator, 
which updates the override indicator to a true value if the 
prediction generated by the return stack mispredicted the 
target address of a first occurrence of the return 
instruction. The apparatus also includes branch control 
logic, coupled to the override indicator, which selects the 
prediction generated by the BTAC for a second occurrence of 
the return instruction, rather than selecting the 
prediction generated by the return stack, if the override 
indicator is true. 
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[0014] In another aspect, the present invention provides 
a method for predicting a target address of a return 
instruction in a microprocessor. The method includes 
updating an override indicator to a true value in response 
to a return stack mispredicting the target address of the 
return instruction. The method also includes a branch 
target address cache (BTAC) generating a prediction of the 
target address subsequent to the updating. The method 
includes determining whether the override indicator has a 
true value after the BTAC generates the prediction. The 
method includes branching the microprocessor to the 
prediction generated by the BTAC if the override indicator 
has a true value. 

[0015] In another aspect, the present invention provides 
apparatus for improving branch prediction accuracy in a 
microprocessor having a return stack and an alternate 
prediction apparatus that each generates a prediction of a 
target address of a return instruction, and a branch target 
address cache (BTAC) . The apparatus includes an override 
indicator, provided by the BTAC. The apparatus also 
includes update logic, coupled to the override indicator, 
which updates the override indicator in the BTAC to a true 
value if the prediction generated by the return stack 
mispredicted the target address of a first occurrence of 
the return instruction. The apparatus also includes branch 
control logic, coupled to the override indicator, which 
selects the prediction generated by the alternate 
prediction apparatus for a second occurrence of the return 
instruction, rather than selecting the prediction generated 
by the return stack, if the override indicator is true. 
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[0016] In another aspect, the present invention provides 
a computer data signal embodied in a transmission medium, 
comprising computer-readable program code for providing a 
microprocessor. The program code includes first program 
code for providing a return stack, for providing a first 
prediction of a target address of a return instruction. 
The program code also includes second program code for 
providing a branch target address cache (BTAC) , for 
providing a second prediction of the target address of the 
return instruction, and for providing an override 
indicator. The override indicator indicates a 

predetermined value if the first prediction mispredicted 
the target address for a first instance of the return 
instruction. The program code also includes third program 
code for providing branch control logic, coupled to the 
return stack and the BTAC, for causing the microprocessor 
to branch to the second prediction of the target address, 
and not to the first prediction, for a second instance of 
the return instruction, if the override indicator indicates 
the predetermined value. 

[0017] An advantage of the present invention is that it 
potentially improves branch prediction accuracy of programs 
that engage in non-standard call/return sequences. 
Simulations performed have shown a performance improvement 
on benchmark scores when employing the override mechanism 
as described in embodiments herein. Additionally, the 
advantage is realized with the addition of a small amount 
of hardware if a microprocessor already includes a BTAC and 
an alternate return instruction target address prediction 
mechanism. 
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[0018] Other features and advantages of the present 
invention will become apparent upon study of the remaining 
portions of the specification and drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0019] FIGURE 1 is a block diagram of a pipelined 
microprocessor according to the present invention. 
[0020] FIGURE 2 is a flowchart illustrating operation of 
the microprocessor of Figure 1 according to the present 
invention . 

[0021] FIGURE 3 is a flowchart illustrating operation of 
the microprocessor of Figure 1 according to the present 
invention . 

[0022] FIGURE 4 is a flowchart illustrating operation of 
the microprocessor of Figure 1 according to the present 
invention. 

[0023] FIGURE 5 is a block diagram of a pipelined 
microprocessor according to an alternate embodiment of the 
present invention. 

[0024] FIGURE 6 is a flowchart illustrating operation of 
the microprocessor of Figure 5 according to an alternate 
embodiment of the present invention. 

DETAILED DESCRIPTION 

[0025] Referring now to Figure 1, a block diagram of a 
pipelined microprocessor 100 according to the present 
invention is shown. In one embodiment , microprocessor 100 
comprises a microprocessor whose instruction set conforms 
substantially to an x86 architecture instruction set, 
including x8 6 CALL and RET instructions. However, the 
present invention is not limited to x86 architecture 
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microprocessors, but rather may be employed in any 
microprocessor that employs a return stack to predict 
target addresses of return instructions. 

[0026] Microprocessor 100 includes an instruction cache 
108. The instruction cache 108 caches instruction bytes 
from a system memory coupled to microprocessor 100. 
Instruction cache 108 caches lines of instruction bytes. 
In one embodiment, a cache line comprises 32 bytes of 
instruction bytes. Instruction cache 108 receives a fetch 
address 132 from a multiplexer 106. Instruction cache 108 
outputs a cache line of instruction bytes 186 specified by 
fetch address 132 if fetch address 132 hits in instruction 
cache 108. In particular, the cache line of instruction 
bytes 186 specified by fetch address 132 may include one or 
more return instructions. The instruction bytes 186 are 
piped down the microprocessor 100 pipeline via pipeline 
registers 121 and 123, as shown. Although only two 
pipeline registers 121 and 123 are shown for piping down 
instruction bytes 186, other embodiments may include more 
pipeline stages. 

[0027] Microprocessor 100 also includes an instruction 
decoder, referred to as F-stage instruction decoder 114, 
coupled to the output of pipeline register 123. 
Instruction decoder 114 receives instruction bytes 186 and 
related information and decodes the instruction bytes. In 
one embodiment, microprocessor 100 supports instructions of 
variable length. Instruction decoder 114 receives a stream 
of instruction bytes and formats the instructions into 
discrete instructions, determining the length of each 
instruction. In particular, instruction decoder 114 
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generates a true value on a ret signal 154 to indicate that 
it has decoded a return instruction. In one embodiment, 
microprocessor 100 includes a reduced-instruction-set- 
computer (RISC) core that executes microinstructions, and 
instruction decoder 114 translates macroinstructions, such 
as x86 macroinstructions, into microinstructions of the 
native RISC instruction set. The microinstructions are 
piped down the microprocessor 100 pipeline via pipeline 
registers 125 and 127, as shown. Although only two 
pipeline registers 125 and 127 are shown for piping down 
the microinstructions, other embodiments may include more 
pipeline stages. For example, the stages may include a 
register file, an address generator, a data load/store 
unit, an integer execution unit, a floating-point execution 
unit, an MMX execution unit, an SSE execution unit, and an 
SSE-2 execution unit. 

[0028] Microprocessor 100 also includes branch 
resolution logic, referred to as E-stage branch resolution 
logic 124, coupled to the output of pipeline register 127. 
Branch resolution logic 124 receives branch instructions, 
including return instructions, as they migrate down the 
microprocessor 100 pipeline and makes a final determination 
of the target address of all branch instructions. Branch 
resolution logic 124 provides the correct branch 
instruction target address as an input to multiplexer 106 
on E-stage target address signal 148. Additionally, if a 
target address was predicted for the branch instruction, 
branch resolution logic 124 receives a predicted target 
address. Branch resolution logic 124 compares the 

predicted target address with the correct target address 
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148 and determines whether a misprediction of the target 
address was made, such as by a BTAC array 102, a BTAC 
return stack 104, or an F-stage return stack 116, which are 
all described in detail below. If a target address 
misprediction was made, branch resolution logic 124 
generates a true value on a misprediction signal 158. 
[0029] Microprocessor 100 also includes branch control 
logic 112, coupled to multiplexer 106. Branch control 
logic 112 generates a mux select signal 168 to control 
multiplexer 106 to select one of multiple input addresses, 
described below, to output as fetch address 132. The 
operation of branch control logic 112 is described in more 
detail below. 

[0030] Microprocessor 100 also includes an adder 182 
that receives fetch address 132 and increments fetch 
address 132 to provide a next sequential fetch address 162 
as an input to multiplexer 106. If no branch instructions 
are predicted or executed during a given clock cycle, 
branch control logic 112 controls multiplexer 106 to select 
next sequential fetch address 162. 

[0031] Microprocessor 100 also includes a branch target 
address cache (BTAC) array 102, coupled to receive fetch 
address 132. BTAC array 102 includes a plurality of 
storage elements, or entries, each for caching a branch 
instruction target address and related branch prediction 
information. When fetch address 132 is input to 

instruction cache 108 and instruction cache 108 
responsively provides the line of instruction bytes 186, 
BTAC array 102 substantially concurrently provides a 
prediction of whether a branch instruction is present in 



Docket CNTR.2231 13 

the cache line 186, a predicted target address of the 
branch instruction, and whether the branch instruction is a 
return instruction. Advantageously, according to the 
present invention, BTAC array 102 also provides an override 
indicator for indicating whether the target address of the 
return instruction should be predicted by the BTAC array 
102 rather than by a return stack, as described below in 
detail . 

[0032] The target address 164 of the return instruction 
predicted by BTAC array 102 is provided as an input to a 
second multiplexer 126. The output of multiplexer 126, 
target address 144, is provided as an input to multiplexer 
106. Target address 144 is also piped down the 

microprocessor 100 pipeline via pipeline registers 111 and 
113, as shown. The output of pipeline register 113 is 
referred to as target address 176. Although only two 
pipeline registers 111 and 113 are shown for piping down 
target address 144, other embodiments may include more 
pipeline stages. 

[0033] In one embodiment, BTAC array 102 is configured 
as a 2-way set associative cache capable of storing 4096 
target addresses and related information. However, the 
present invention is not limited to a particular embodiment 
of BTAC array 102. In one embodiment, the lower bits of 
fetch address 132 select one set, or row, in BTAC array 
102. An address tag is stored for each entry in BTAC array 
102 that indicates the upper address bits of the address of 
the branch instruction whose target address is stored in 
the corresponding entry. The upper bits of fetch address 
132 are compared with the address tags of each entry of the 
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selected set. If the upper bits of fetch address 132 match 
a valid address tag in the selected set, then a hit in BTAC 
array 102 occurs, which indicates that BTAC array 102 
predicts a branch instruction is present in the instruction 
cache line 186 selected by fetch address 132 and output by 
instruction cache 108 substantially concurrently with 
target address prediction 164. 

[0034] Each entry of BTAC array 102 also stores an 
indication of the type of branch instruction present in the 
instruction cache line 186 specified by fetch address 132. 
That is, BTAC array 102 also stores the type of the branch 
instruction whose predicted target address 164 is provided 
by BTAC array 102 to multiplexer 126. In particular, if 
the branch instruction type is a return instruction, BTAC 
array 102 generates a true value on a ret signal 138, which 
is provided to branch control logic 112. Additionally, 
BTAC array 102 outputs an override signal 136, discussed in 
detail below, which is also provided to branch control 
logic 112. In one embodiment, the branch instruction type 
field stored in each BTAC array 102 entry comprises two 
bits, which are encoded as shown in Table 1. 



00 



not RET or CALL 



01 



CALL 



10 



normal RET 



11 



override RET 



Table 1. 
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[0035] In one embodiment, the most significant bit of 
the branch type field is provided on ret signal 138 and the 
least significant bit of the override signal is provided on 
override signal 136. In the case of a CALL instruction, 
the override signal 136 is not used. As may be observed, 
no additional storage is required to accommodate the 
override bit since the type field was already two bits and 
only three of the four possible states were being used. 
The override signal 136 is piped down the microprocessor 
100 pipeline via pipeline registers 101, 103, 105, and 107, 
as shown. In particular, the output of pipeline register 

103, denoted override_F signal 172, is provided to branch 
control logic 112. Additionally, the output of pipeline 
register 107 is denoted override_E signal 174. Although 
only four pipeline registers 101, 103, 105, and 107 are 
shown for piping down override signal 136, other 
embodiments may include more pipeline stages. 

[0036] In one embodiment, when branch resolution logic 
124 resolves a new call instruction, the target address of 
the call instruction is cached in BTAC array 102 along with 
a type field value indicating a CALL instruction. 
Similarly, when branch resolution logic 124 resolves a new 
return instruction, the target address of the return 
instruction is cached in BTAC array 102 along with a type 
field value indicating a normal RET instruction. 
[0037] Microprocessor 100 also includes a return stack 

104, referred to as BTAC return stack 104, coupled to 
receive ret signal 138 from BTAC array 102. BTAC return 
stack 104 caches return addresses specified by call 
instructions in a last-in-first-out manner. In one 
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embodiment, when branch resolution logic 124 resolves a new 
call instruction, the return address specified by the call 
instruction is pushed onto the top of BTAC return stack 
104. When BTAC array 102 indicates via ret signal 138 that 
a return instruction is present in the cache line 186 
specified by fetch address 132, the return address at the 
top of BTAC return stack 104 is popped and provided as a 
target address 142 to multiplexer 126. Branch control 
logic 112 controls multiplexer 126 via a control signal 184 
to select the target address 142 predicted by BTAC return 
stack 104 if ret signal 138 is true and override signal 136 
is false. Otherwise, branch control logic 112 controls 
multiplexer 126 via control signal 184 to select the target 
address 164 predicted by BTAC array 102. 

[0038] Microprocessor 100 also includes a second return 
stack 116, referred to as F-stage return stack 116, coupled 
to receive ret signal 154 from instruction decoder 114. F- 
stage return stack 116 caches return addresses specified by 
call instructions in a last-in-first-out manner. In one 
embodiment, when branch resolution logic 124 resolves a new 
call instruction, the return address specified by the call 
instruction is pushed onto the top of F-stage return stack 
116. When instruction decoder 114 indicates via ret signal 
154 that a return instruction has been decoded, the return 
address at the top of F-stage return stack 116 is popped 
and provided as a target address 146 to multiplexer 106. 
[0039] Microprocessor 100 also includes a comparator 
118. Comparator 118 compares F-stage return stack 116 
target address 146 and piped-down target address 176. 
Comparator 118 generates a true value on a mismatch signal 
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152, which is provided to branch control logic 112, if F- 
stage return stack 116 target address 146 and piped-down 
target address 176 do not match. Branch control logic 112 
controls multiplexer 106 via control signal 168 to select 
F-stage return stack 116 target address 146 if ret signal 
154 is true, if override_F 172 is false, and if mismatch 
signal 152 is true. Otherwise, branch control logic 112 
controls multiplexer 106 via control signal 168 to select 
one of its other inputs. 

[0040] Microprocessor 100 also includes BTAC update 
logic 122, coupled to branch resolution logic 124 and BTAC 
array 102. BTAC update logic 122 receives misprediction 
signal 158 from branch resolution logic 124. BTAC update 
logic 122 receives overrideJE signal 174 from pipeline 
register 107. BTAC update logic 122 generates a BTAC 
update request signal 134, which is provided to BTAC array 
102. BTAC update request signal 134 includes information 
for updating an entry in BTAC array 102. In one 

embodiment, BTAC update request signal 134 includes a 
target address of a branch instruction, the address of the 
branch instruction, and a value for the type field. 
[0041] When branch resolution logic 124 resolves a new 
branch instruction, BTAC update logic 122 generates a BTAC 
update request 134 to update BTAC array 102 with the 
information for predicting the target address and type of 
the new branch instruction on a subsequent occurrence, or 
instance, of the branch instruction in an instruction cache 
line specified by fetch address 132. Additionally, if 
misprediction signal 158 is true, BTAC update logic 122 
generates a BTAC update request 134 to update the entry in 
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BTAC array 102 associated with the branch instruction. In 
particular, if the branch instruction is a return 
instruction that was mispredicted by BTAC return stack 104 
or by F-stage return stack 116, BTAC update logic 122 
assigns the override bit in the BTAC array 102 entry to a 
predetermined value to indicate that the BTAC return stack 
104 prediction 142 and F-stage return stack 116 prediction 
146 should be overridden by the BTAC array 102 prediction 
164 on the next occurrence or instance of the return 
instruction. In one embodiment, the type field is set to 
an override RET value, or 11 as specified in Table 1 above. 
Conversely, if the branch instruction is a return 
instruction that was mispredicted by BTAC array 102 because 
the override bit was set, BTAC update logic 122 assigns the 
override bit in the BTAC array 102 entry to a predetermined 
value to indicate that the BTAC return stack 104 prediction 
142 and, if necessary, the F-stage return stack 116 
prediction 146 should be selected rather than the BTAC 
array 102 prediction 164 on the next occurrence or instance 
of the return instruction. In one embodiment, the type 
field is set to a normal RET value, or 10 as specified in 
Table 1 above. The operation of microprocessor 100 will 
now be described more fully with respect to Figures 2 
through 4 . 

[0042] Referring now to Figure 2, a flowchart 
illustrating operation of microprocessor 100 of Figure 1 
according to the present invention is shown. Figure 2 
describes the operation of microprocessor 100 in response 
to a prediction of a return instruction by the BTAC array 
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102 and the BTAC return stack 104 of Figure 1. Flow begins 
at block 202. 

[0043] At block 202, fetch address 132 of Figure 1 is 
applied to instruction cache 108 of Figure 1 and BTAC array 
102 in parallel. In response, instruction cache 108 
provides the cache line of instruction bytes 186 of Figure 
1 to the microprocessor 100 pipeline in response to fetch 
address 132. Flow proceeds to block 204. 

[0044] At block 204, BTAC array 102 predicts, based on 
fetch address 132, via ret signal 138 that a return 
instruction is present in the instruction cache line 18 6 
provided by instruction cache 108 to the microprocessor 100 
pipeline, and BTAC array 102 provides target address 164 to 
multiplexer 126. Flow proceeds to decision block 206. 
[0045] At decision block 206, branch control logic 112 
determines whether override indicator 136 is set. If so, 
flow proceeds to block 212; otherwise, flow proceeds to 
block 208. 

[0046] At block 208, branch control logic 112 controls 
multiplexer 126 and multiplexer 106 to select the BTAC 
return stack target address 142 as fetch address 132 to 
branch microprocessor 100 thereto. Flow ends at block 208. 
[0047] At block 212, branch control logic 112 controls 
multiplexer 126 and multiplexer 106 to select the BTAC 
array target address 164 as fetch address 132 to branch 
microprocessor 100 thereto. Flow ends at block 212. 
[0048] As may be observed from Figure 2, if the override 
indicator 136 is set, such as during a previous occurrence 
of the return instruction as described below with respect 
to block 408, branch control logic 112 advantageously 
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overrides the BTAC return stack 104 and alternatively 
selects the target address 164 predicted by the BTAC array 
102, thereby avoiding an almost certain misprediction by 
the BTAC return stack 104 if the running program is 
executing a non-standard call/return sequence. 
[0049] Referring now to Figure 3, a flowchart 
illustrating operation of microprocessor 100 of Figure 1 
according to the present invention is shown. Figure 3 
describes the operation of microprocessor 100 in response 
to a prediction of a return instruction, such as the return 
instruction predicted in Figure 2, by F-stage return stack 
116 of Figure 1. Flow begins at block 302. 

[0050] At block 302, F-stage instruction decoder 114 of 
Figure 1 decodes the return instruction that was present in 
the instruction cache line 186 output by instruction cache 
108 in response to fetch address 132 that was applied to 
BTAC array 102 in block 202 of Figure 2, and subsequently 
predicted by BTAC array 102 and BTAC return stack 104 as 
described with respect to Figure 2. In response to F-stage 
instruction decoder 114 indicating via ret signal 154 that 
a return instruction was decoded, F-stage return stack 116 
provides its predicted target address 146 to multiplexer 
106. Flow proceeds to block 304. 

[0051] At block 304, comparator 118 of Figure 1 compares 
F-stage return stack predicted target address 146 and 
target address 176 of Figure 1. Comparator 118 generates a 
true value on mismatch signal 152 of Figure 1 if addresses 
146 and 176 do not match. Flow proceeds to decision block 
306. 
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[0052] At decision block 306, branch control logic 112 
examines mismatch signal 152 to determine whether a 
mismatch occurred. If so, flow proceeds to decision block 
308; otherwise, flow ends. 

[0053] At decision block 308, branch control logic 112 
examines override_F signal 172 of Figure 1 to determine 
whether override_F bit 172 is set. If so, flow ends, i.e., 
the branch to the BTAC array target address 164 performed 
at block 212 of Figure 2 is not superceded by the F-stage 
return stack predicted target address 146. If override_F 
bit 172 is clear, flow proceeds to block 312. 
[0054] At block 312, branch control logic 112 controls 
multiplexer 106 to select the F-stage return stack 
predicted target address 146 to branch microprocessor 100 
thereto. In one embodiment, before branching to the F- 
stage return stack predicted target address 146, 
microprocessor 100 flushes the instructions in the stages 
above the F-stage. Flow ends at block 312. 

[0055] As may be observed from Figure 3, if override_F 
indicator 172 is set, such as during a previous occurrence 
of the return instruction as described below with respect 
to block 408, branch control logic 112 advantageously 
overrides the F-stage return stack 116 and alternatively 
sustains the target address 164 predicted by the BTAC array 
102, thereby avoiding an almost certain misprediction by 
the F-stage return stack 116 if the running program is 
executing a non-standard call/return sequence. 
[0056] Referring now to Figure 4, a flowchart 
illustrating operation of microprocessor 100 of Figure 1 
according to the present invention is shown. Figure 4 
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describes the operation of microprocessor 100 in response 
to resolution of a return instruction, such as a previous 
instance of the return instruction predicted and decoded in 
Figures 2 and 3. Flow begins at block 402. 

[0057] At block 402, E-stage branch resolution logic 124 
of Figure 1 resolves a return instruction. That is, branch 
resolution logic 124 finally determines the correct target 
address 148 of Figure 1 of the return instruction. In 
particular, branch resolution logic 124 generates a true 
value on misprediction signal 158 of Figure 1 if 
microprocessor 100 was caused to branch to an incorrect 
target address of the return instruction. Flow proceeds to 
decision block 404. 

[0058] At decision block 404, BTAC update logic 122 
examines misprediction signal 158 to determine whether the 
return instruction target address was mispredicted. If so, 
flow proceeds to decision block 406; otherwise flow ends. 
[0059] At decision block 406, BTAC update logic 122 
examines override_E signal 174 to determine whether the 
override_E bit 174 is set. If so, flow proceeds to block 
408; otherwise, flow proceeds to block 412. 

[0060] At block 408, BTAC update logic 122 generates a 
BTAC update request 134 to clear the override bit for the 
entry that mispredicted the return instruction. The 
present inventors have observed that a given return 
instruction may be reached from multiple program paths. 
That is, sometimes a return may be reached from a non- 
standard code path, such as one of the code paths described 
above, which always cause a return stack to mispredict the 
target address of the return instruction; however, the same 
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return instruction may also be reached from a code path 
that constitutes a standard call/return pair sequence. In 
the latter case, return stacks generally more accurately 
predict the target address of a return instruction. 
Consequently, if a misprediction occurs when the override 
bit is set, BTAC update logic 122 clears the override bit 
in block 408 because it is anticipated that the standard 
call/return pair sequence is predominating. Flow proceeds 
to block 414. 

[0061] At block 412, BTAC update logic 122 generates a 
BTAC update request 134 to set the override bit in the 
appropriate entry in BTAC array 102 since F-stage return 
stack 116 mispredicted the return instruction target 
address. By setting the BTAC 102 override bit for the 
entry storing the prediction for the return instruction, 
the present invention advantageously solves the problem 
created by a non-standard call/return sequence. That is, 
the BTAC array target address 164 is branched to rather 
than branching to the BTAC return stack target address 142 
or the F-stage return stack predicted target address 146, 
which would incorrectly predict the return instruction 
target address. Flow proceeds to block 414. 

[0062] At block 414, microprocessor 100 flushes its 
pipeline, since the misprediction of the return instruction 
target address caused the incorrect instructions to be 
fetched into the microprocessor 100 pipeline from 
instruction cache 108; hence, those instructions must not 
be executed. Subsequently, branch control logic 112 
controls multiplexer 106 to select the E-stage target 
address 14 6 to branch microprocessor 100 thereto in order 
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to fetch the correct target instructions. Flow ends at 
block 414. 

[0063] In one embodiment, block 412 updates the type 
field of the BTAC array 102 entry with a binary value of 11 
and block 408 updates the type field of the BTAC array 102 
entry with a binary value of 10, according to Table 1 
above . 

[0064] As may be observed from Figures 2 through 4, the 
override indicator potentially improves the prediction 
accuracy for a return instruction. If the microprocessor 
senses that a return instruction may have been executed as 
part of a non-standard call/return sequence by the fact 
that the return stack mispredicted the target address of 
the return instruction, then the microprocessor sets the 
override indicator associated with the return instruction 
in the BTAC, and on the next instance of the return 
instruction, the microprocessor uses a prediction mechanism 
other than the return stack to predict the return 
instruction target address, since the microprocessor 
determines from the override indicator that the return 
stack is likely to mispredict the target address for the 
present occurrence of the return instruction. Conversely, 
although a return instruction may have been previously 
executed as part of a non-standard call/return sequence, if 
the microprocessor senses that the return instruction may 
subsequently have been executed as part of a standard 
call/return sequence by the fact that the BTAC array 
mispredicted the return instruction target address, then 
the microprocessor clears the override indicator associated 
with the return instruction in the BTAC, and on the next 
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instance of the return instruction, the microprocessor uses 
the return stack to predict the return instruction target 
address, since the microprocessor determines from the 
override indicator that the return stack is likely to 
correctly predict the target address for the present 
occurrence of the return instruction. 

[0065] Referring now to Figure 5, a block diagram of a 
pipelined microprocessor 500 according to an alternate 
embodiment of the present invention is shown. 
Microprocessor 500 of Figure 5 is similar to microprocessor 
100 of Figure 1, except that it does not include BTAC 
return stack 104 or multiplexer 126. Consequently, the 
predicted target address 164 output by BTAC array 102 is 
provided directly to multiplexer 106, rather than through 
multiplexer 126. Additionally, BTAC array 102 target 
address 164, rather than target address 144 of Figure 1, is 
provided as the input to pipeline register 111 and piped 
down as target address 17 6. 

[0066] Referring now to Figure 6, a flowchart 
illustrating operation of microprocessor 500 of Figure 5 
according to an alternate embodiment of the present 
invention is shown. Figure 6 is similar to Figure 2, 
except that decision block 206 and block 208 are not 
present; hence, flow proceeds from block 204 to block 212. 
Therefore, when BTAC array 102 predicts a return 
instruction via ret signal 138, branch control logic 112 
always operates to cause microprocessor 500 to branch to 
the target address 164 predicted by BTAC array 102, since 
BTAC return stack 104 and multiplexer 126 of microprocessor 
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100 of Figure 1 are not present in microprocessor 500 of 
Figure 5. 

[0067] Microprocessor 500 of Figure 5 also operates 
according to the flowcharts of Figures 3 and 4. It is 
noted that since BTAC return stack 104 is not present in 
microprocessor 500, piped-down target address 176 is always 
the BTAC array 102 target address 164; hence, the 
comparison performed in block 304 between F~stage return 
stack 116 target address 146 and target address 176 is 
always a comparison with piped-down BTAC array 102 target 
address 164. 

[0068] Although the present invention and its objects, 
features and advantages have been described in detail, 
other embodiments are encompassed by the invention. For 
example, although embodiments have been described wherein 
the microprocessor has two return stacks, the 
microprocessor may have other numbers of returns stacks, 
such as only a single return stack, or more than two return 
stacks. Furthermore, although embodiments are described in 
which the BTAC is the alternate target address prediction 
mechanism for overriding the return stack in addition to 
storing the override bit associated with the return 
instruction mispredicted by the return stack, other 
alternate target address prediction mechanisms may be 
employed, such as a branch target buffer. 

[0069] Also, although the present invention and its 
objects, features and advantages have been described in 
detail, other embodiments are encompassed by the invention. 
In addition to implementations of the invention using 
hardware, the invention can be implemented in computer 
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readable code (e.g., computer readable program code, data, 
etc.) embodied in a computer usable (e.g., readable) 
medium. The computer code causes the enablement of the 
functions or fabrication or both of the invention disclosed 
herein. For example, this can be accomplished through the 
use of general programming languages (e.g., C, C++, JAVA, 
and the like) ; GDSII databases; hardware description 
languages (HDL) including Verilog HDL, VHDL, Altera HDL 
(AHDL) , and so on; or other programming and/or circuit 
(i.e., schematic) capture tools available in the art. The 
computer code can be disposed in any known computer usable 
(e.g., readable) medium including semiconductor memory, 
magnetic disk, optical disk (e.g., CD-ROM, DVD-ROM, and the 
like), and as a computer data signal embodied in a computer 
usable (e.g., readable) transmission medium (e.g., carrier 
wave or any other medium including digital, optical or 
analog-based medium) . As such, the computer code can be 
transmitted over communication networks, including 
Internets and intranets. It is understood that the 
invention can be embodied in computer code (e.g., as part 
of an IP (intellectual property) core, such as a 
microprocessor core, or as a system-level design, such as a 
System on Chip (SOC) ) and transformed to hardware as part 
of the production of integrated circuits. Also, the 
invention may be embodied as a combination of hardware and 
computer code . 

[0070] Finally, those skilled in the art should 
appreciate that they can readily use the disclosed 
conception and specific embodiments as a basis for 
designing or modifying other structures for carrying out 
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the same purposes of the present invention without 
departing from the spirit and scope of the invention as 
defined by the appended claims. 



We claim: 



